ChatGPT or Grammarly? Evaluating ChatGPT on Grammatical Error Correction Benchmark

Haoran WuWenxuan WangYuxuan Wan Wenxiang JiaoMichael R. Lyu

Department of Computer Science and Engineering, The Chinese University of Hong Kong

1155157061@link.cuhk.edu.hk {wxwang,yxwan9,lyu}@cse.cuhk.edu.hk

Tencent AI Lab

joelwxjiao@tencent.com


Abstract

arXiv:2303.13648v1 [cs.CL] 15 Mar 2023

ChatGPT is a cutting-edge artificial intelli- gence language model developed by OpenAI, which has attracted a lot of attention due to its surprisingly strong ability in answering follow-up questions. In this report, we aim to evaluate ChatGPT on the Grammatical Er- ror Correction (GEC) task, and compare it with commercial GEC product (e.g., Gram- marly) and state-of-the-art models (e.g., GEC- ToR). By testing on the CoNLL2014 bench- mark dataset, we find that ChatGPT performs not as well as those baselines in terms of the automatic evaluation metrics (e.g., F0.5 score), particularly on long sentences. We inspect the outputs and find that ChatGPT goes be- yond one-by-one corrections. Specifically, it prefers to change the surface expression of certain phrases or sentence structure while maintaining grammatical correctness. Human evaluation quantitatively confirms this and suggests that ChatGPT produces less under- correction or mis-correction issues but more over-corrections. These results demonstrate that ChatGPT is severely under-estimated by the automatic evaluation metrics and could be a promising tool for GEC.

1 Introduction

ChatGPT1, the current “super-star” in artificial in- telligence (AI) area, has attracted millions of reg- istered users within just a week since its launch by OpenAI. One of the reasons for ChatGPT being so popular is its surprisingly strong per- formance on various natural language process- ing (NLP) tasks (Bang et al., 2023), including ques- tion answering (Omar et al., 2023), text summariza- tion (Yang et al., 2023), machine translation (Jiao et al., 2023), logic reasoning (Frieder et al., 2023), code debugging (Xia and Zhang, 2023), etc. There is also a trend of using ChatGPT as a writing assis- tant for text polishing.


1https://chat.openai.com/chat

Despite the widespread use of ChatGPT, it re- mains unclear to the NLP community that to what extent ChatGPT is capable of revising the text and correcting grammatical errors. To fill this research gap, we empirically study the Grammatical Error Correction (GEC) ability of ChatGPT by evalu- ating on the CoNLL2014 benchmark dataset (Ng et al., 2014), and comparing its performance to Grammarly, a prevalent cloud-based English typing assistant with 30 million users daily (Grammarly, 2023) and GECToR (Omelianchuk et al., 2020), a state-of-the-art GEC model. With this study, we aim to answer a research question:


Is ChatGPT a good tool for GEC?


To the best of our knowledge, this is the first study on ChatGPT’s ability in GEC.

We present the major insights gained from this evaluation as below:



Our evaluation indicates the limitation of relying solely on automatic evaluation metrics to assess the performance of GEC models and suggests that ChatGPT is a promising tool for GEC.


Type Error Correction

Preposition I sat in the talk I sat in on the talk Morphology dreamed dreamt Determiner I like the ice cream I like ice cream Tense/Aspect I like play basketball I like playing basketball Syntax I have not the book I do not have the book Punctuation We met they talked and left We met, they talked and left

Table 1: Different types of error in GEC.


the general public due to its strong ability in an- swering various follow-up questions, correcting inappropriate questions (Zhong et al., 2023), and even refusing illegal questions. While the tech- nical details of ChatGPT have not been released systematically, it is known to be built upon Instruct- GPT (Ouyang et al., 2022) which is trained using instruction tuning (Wei et al., 2022a) and reinforce- ment learning from human feedback (RLHF, Chris- tiano et al., 2017).

    1. Grammatical Error Correction

      Grammatical Error Correction (GEC) is a task of correcting different kinds of errors in text such as spelling, punctuation, grammatical, and word choice errors (Ruder, 2022). It is highly demanded as writing plays an important role in academics, work, and daily life. Table 1 presents the illustra- tion of different grammatical errors borrowed from Bryant et al. (2022) in a comprehensive survey on grammatical error correction. In general, gram- matical errors can be roughly classified into three categories: omission errors, such as "on" in the first example; replacement errors, such as "dreamed" for "dreamt" in the second example; and insertion errors, such as "the" in the third example.

      To evaluate the performance of GEC, researchers have built various benchmark datasets, which in- clude but are not limited to:


  1. ChatGPT for GEC

    1. Experimental Setup

      Dataset. We evaluate the ability of ChatGPT in grammatical error correction on the CoNLL2014 task (Ng et al., 2014) dataset. The dataset is com- posed by short paragraphs that are written by non- native speakers of English, accompanied with the corresponding annotations on the grammatical er- rors. We pulled 100 sentences from the official- combined test set in the alternate folder of the dataset sequentially.

      Evaluation Metric. To evaluate the performance of GEC, we adopt three metrics that are widely used in literature, namely, Precision, Recall, and F0.5 score. Among them, F0.5 score combines both Precision and Recall, where Precision is assigned a higher weight (Wikipedia contributors, 2023a).


      Precision

      Recall

      F0.5


      Precision

      Recall

      F0.5


      Precision

      Recall

      F0.5


      GECToR

      76.9

      38.5

      64.1


      68.8

      37.5

      58.9


      71.8

      38.9

      61.5


      Grammarly

      62.5

      60.6

      62.1


      68.9

      56.0

      65.9


      67.3

      45.3

      61.4


      ChatGPT

      58.5

      66.7

      60.0


      48.7

      60.7

      50.7


      51.0

      62.8

      53.0


      System Short Medium Long


      Table 3: GEC performance with respect to sentence length.


      Specifically, the three metrics are expressed as:

      the grammar correction in the setting and only ask it to correct the ones with correctness prob-

      TP

      Precision =

      TP + FP

      , (1)

      lems (red underline), while leaving the clarity (blue underline), engagement (green underline)

      TP

      Recall =

      TP + FN

      , (2)

      and delivery (purple underline) unchanged. We iterate this process several times until there is no

      F0.5

      1.25 × Precision × Recall

      = , (3)

      0.25 × Precision + Recall

      error detected by Grammarly.

      • GECToR: Besides Grammarly, we also compare

      where TP , FP and FN represent the true posi- tives, false positives and false negatives of the pre- dictions, respectively. We use the scoring program provided by CoNLL2014 official but adapt it to be compatible with the latest Python environment.

      Baselines. In this report, we perform the GEC task on three systems, including:

    2. Results and Analysis

      Overall Performance. Table 2 presents the over- all performance of the three systems. As seen, ChatGPT obtains the highest recall value, GECToR obtains the highest precision value, while Gram- marly achieves a better balance between the two metrics and results in the highest F0.5 score. These results suggest that ChatGPT tends to correct as many errors as possible, which may lead to more overcorrections. Instead, GECToR corrects only those it is confident about, which leaves many er- rors uncorrected. Grammarly combines the advan- tages of both such that it performs more stably.

      ChatGPT Performs Worse on Long Sentences? To understand which kind of sentences ChatGPT are good at, we divide the 100 test sentences into three equally sized categories, namely, Short, Medium and Long. Table 3 shows the results with respect to sentence length. As seen, the gap be- tween ChatGPT and Grammarly is significantly bridged on short sentences. In contrast, ChatGPT performs much worse on those longer sentences, at least in terms of the existing evaluation metrics.

      ChatGPT Goes Beyond One-by-One Correc- tions. We inspect the output of the three systems, especially those for long sentences, and find that

      System Sentence

      Source For an example , if exercising is helpful for family potential disease , we can always look for more chances for the family to go exercise .

      Reference For example , if exercising (OR exercise) is helpful for a potential family disease

      , we can always look for more chances for the family to do exercise .

      GECToR For example , if exercising is helpful for family potential disease , we can always look for more chances for the family to go exercise .

      Grammarly For example , if exercising is helpful for a family ’s potential disease , we can always look for more chances for the family to go exercise .

      ChatGPT For example , if exercise is helpful in preventing potential family diseases , we can always look for more opportunities for the family to exercise .

      Table 4: Comparison of the outputs from different GEC systems.





      Table 5: GEC performance with Grammarly for further correction.


      ChatGPT is not limited to correcting the errors in the one-by-one fashion. Instead, it is more will- ing to change the superficial expression of some phrases or the sentence structure. For example, in Table 4, GECToR and Grammarly make mi- nor changes to the source sentence (i.e., “an ex- ample” to “example”, “family potential disease” to “a family ’s potential disease”), while ChatGPT modifies the sentence structure (i.e., “for family potential disease” to “in preventing potential fam- ily diseases”) and word choice (i.e., “chances” to “opportunities”). It indicates that the outputs by ChatGPT maintain the grammatical correctness, al- though they do not follow the original expression of the source sentences.

      To validate our hypothesis, we let Grammarly to further correct the grammatical errors in the out- puts of GECToR and ChatGPT. Table 5 lists the results. We can observe that Grammarly introduces a negligible improvement to the output of ChatGPT, demonstrating that ChatGPT indeed generates cor- rect sentences. On the contrary, Grammarly further improves the performance of GECToR noticeably (i.e., +2.1 F0.5, +16.5 Recall), suggesting that there are still many errors in the output of GECToR.

      Table 6: Number of under-correction (Under), mis- correction (Mis) and over-correction (Over) produced by different GEC systems.


      System

      Precision

      Recall

      F0.5


      System

      #Under

      #Mis

      #Over

      GECToR

      71.2

      38.4

      60.8


      GECToR

      13

      4

      0

      + Grammarly

      -5.9

      +16.5

      +2.1


      Grammarly

      14

      0

      1

      ChatGPT

      51.2

      62.8

      53.1


      ChatGPT

      3

      3

      30

      + Grammarly

      +0.4

      +0.8

      +0.5






      Human Evaluation. We conduct a human eval- uation to further demonstrate the potential of Chat- GPT for the GEC task. Specifically, we fol- low Wang et al. (2022) to manually annotate the issues in the outputs of the three systems, includ- ing 1) Under-correction, which is the grammati- cal errors that are not found; 2) Mis-correction, which is the grammatical errors that are found but modified incorrectly; it can be either grammati- cally incorrect or semantically incorrect; 3) Over- correction, which is the other modifications beyond the changes in the reference. We sample 20 sen- tences out of the 100 test sentences and ask two annotators to identify the issues. Table 6 shows the results. Obviously, ChatGPT has the least num- ber of under-corrections among the three systems and fewer number of mis-corrections compared with GECToR, which suggests its great potential in grammatical error correction. Meanwhile, Chat- GPT produces more over-corrections, which may come from the diverse generation ability as a large language model. While this usually leads to a lower F0.5 score, it also allows more flexible language expressions in GEC.

      Discussions. We have checked the outputs corre- sponding to the results of Table 5, and observed

      different behaviors of ChatGPT and Grammarly. The slight improvement (i.e., +0.5 F0.5) by Gram- marly mainly comes from punctuation problems. ChatGPT is not sensitive to punctuation problems but Grammarly is, though the modifications are not always correct. For example, when we manually undo the corrections on punctuation, the F0.5 score increases by +0.0015. Other than punctuation prob- lems, Grammarly also corrects a few grammatical errors on articles, prepositions, and plurals. How- ever, these corrections usually require Grammarly to repeat the process twice. Take the following sentence as an example,


      ... constructs of the family and kinship are a social construct,

      ...


      Grammarly first changes it to


      ... constructs of the family and kinship are a social constructs,

      ...


      Then, changes it to


      ... constructs of the family and kinship are social constructs,

      ...


      Nonetheless, it does correct some errors that Chat- GPT fails to correct.


  2. Conclusion

This paper evaluates ChatGPT on the task of Gram- matical Error Correction (GEC). By testing on the CoNLL2014 benchmark dataset, we find that Chat- GPT performs worse than a commercial product Grammarly and a state-of-the-art model GECToR in terms of automatic evaluation metrics. By ex- amining the outputs, we find that ChatGPT dis- plays a unique ability to go beyond one-by-one corrections by changing surface expressions and sentence structure while maintaining grammatical correctness. Human evaluation results confirm this finding and reveals that ChatGPT produces fewer under-correction or mis-correction issues but more over-corrections. These results demonstrate the limitation of relying solely on automatic evaluation metrics to assess the performance of GEC models and suggest that ChatGPT has the potential to be a valuable tool for GEC.

Limitations and Future Works

There are several limitations in this version, which we leave for future work:


References

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wen- liang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Zi- wei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. 2023. A multitask, multilin- gual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. ArXiv.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. NeurIPS.

Christopher Bryant, Mariano Felice, Øistein E. An- dersen, and Ted Briscoe. 2019. The bea-2019 shared task on grammatical error correction. In BEA@ACL.

Christopher Bryant, Zheng Yuan, Muhammad Reza Qorib, Hannan Cao, Hwee Tou Ng, and Ted Briscoe. 2022. Grammatical error correction: A survey of the state of the art. ArXiv.

Paul Francis Christiano, Jan Leike, Tom B. Brown, Mil- jan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human prefer- ences. NeruIPS.

Simon Frieder, Luca Pinchetti, Ryan-Rhys Grif- fiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Christian Petersen, Alexis Chevalier, and J J Berner. 2023. Mathematical capabilities of chatgpt. ArXiv.


Peiyuan Gong, Xuebo Liu, Heyan Huang, and Min Zhang. 2022. Revisiting grammatical error correc- tion evaluation and beyond. EMNLP.

Grammarly. 2023. Grammarly website about us page. Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Xing

Wang, and Zhaopeng Tu. 2023. Is ChatGPT a good translator? a preliminary study. In ArXiv.


Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Christian Hadiwinoto, Raymond Hendy Susanto, and Christo- pher Bryant. 2014. The conll-2014 shared task on grammatical error correction. In CoNLL.


Reham Omar, Omij Mangukiya, Panos Kalnis, and Es- sam Mansour. 2023. Chatgpt versus traditional ques- tion answering for knowledge graphs: Current status and future directions towards knowledge graph chat- bots. ArXiv.


Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem N. Chernodub, and Oleksandr Skurzhanskyi. 2020. Gector – grammatical error correction: Tag, not rewrite. In Workshop on Innovative Use of NLP for Building Educational Applications.


Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- roll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instruc- tions with human feedback. arXiv.

Sebastian Ruder. 2022. NLP-progress.


Joel R. Tetreault, Keisuke Sakaguchi, and Courtney Napoles. 2017. Jfleg: A fluency corpus and bench- mark for grammatical error correction. In EACL.


Wenxuan Wang, Wenxiang Jiao, Yongchang Hao, Xing Wang, Shuming Shi, Zhaopeng Tu, and Michael Lyu. 2022. Understanding and improving sequence-to- sequence pretraining for neural machine translation. In ACL.


Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. 2022a. Finetuned language models are zero-shot learners. ICLR.


Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Huai hsin Chi, Quoc Le, and Denny Zhou. 2022b. Chain of thought prompting elicits reasoning in large language models. NeurIPS.


Wikipedia contributors. 2023a. F-score — Wikipedia, the free encyclopedia. [Online; accessed 5-March- 2023].

Wikipedia contributors. 2023b. Grammarly — Wikipedia, the free encyclopedia. [Online; accessed 2-March-2023].

Chun Xia and Lingming Zhang. 2023. Conversational automated program repair. ArXiv.

Xianjun Yang, Yan Li, Xinlu Zhang, Haifeng Chen, and Wei Cheng. 2023. Exploring the limits of chat- gpt for query or aspect-based text summarization. ArXiv.

Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, and Dacheng Tao. 2023. Can chatgpt understand too? a comparative study on chatgpt and fine-tuned bert. ArXiv.